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We present a Markov chain Monte Carlo algorithm for local alignments of nucleotide sequences 
aiming to infer putative transcription factor binding sites, referred to as the quadratic programming 
sampler. The new motif finder incorporates detailed biophysical modeling of the transcription factor 
binding site recognition which arises an intrinsic threshold discriminating putative binding sites from 
other/background sequences. 

We validate the principal functioning of the algorithm on a sample of four promoter regions from 
Escherichia coli. The resulting description of the motif can be readily evaluated on the whole genome 
to identify new putative binding sites. 
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I. INTRODUCTION 

Transcription factors (TFs) are DNA binding proteins 
with regulatory effects. They may either independently 
or in an interplay with other proteins activate or re- 
press the expression of genes related to the sequence 
they bind to, as illustrated in Figure [I] In this simpli- 
fied picture, they act by either facilitating or impeding 
the recruitment of RNA polymerase holoenzymes, pro- 
tein complexes responsible for the transcription of DNA 
to RNA. Information on exact locations and sequences 
of TF binding sites, typically 8-15 nucleotides, not only 
reveals which genes may or may not be controlled by a 
specific TF, thus permitting the construction of networks 
of genetic interaction, but is also indispensable when pre- 
dicting novel putative binding sites with sequence motifs 
defined by the known examples. Exact binding sequences 
are often still unknown as typically in Escherichia coli, 
where roughly 70 of a total of 231 activating and re- 
pressing TFs have experimentally verified binding motifs 
Pfl [2]. In eukaryotes the picture is, as expected, worse: 
the commercial database TRANSFAC® contains in its 
most recent version 10018 entries for eukaryotic TFs, of 
which just 834 have reported binding motifs [3]. 

In principle, sequences associated to coregulated or 
even homologous genes can be used to infer putative TF 
binding sites by merely aligning those sequences. Sev- 
eral methods have been proposed to perform this task 
efficiently, see [4] for a summary of the most popular ap- 
proaches. In practice, however, each of these methods 
has its flaws [5j |6] . 

The approach we present here is grounded on the rep- 
resentation of TFs by free energy matrices as developed 




FIG. 1: RNA polymerase (RNAP) activity can be controlled 
by activating (A) and/or repressing (R) transcription factor 
proteins. Combinations of binding sites for A and R lead to 
more complex schemes of regulation. 

in QPMEME by Djordjevic et al. [7]. This represen- 
tation yields the contributions of specific nucleotides to 
the total free energy of interaction between the TF and 
a sequence of DNA. Energies are scaled in terms of the 
chemical potential, rendering an intrinsic binding thresh- 
old which simplifies the task of distinguishing possible 
binding sites from the background on genomic sequences. 

II. METHOD 

We proceed by reviewing a simple model of TF-DNA 
binding and the representation of binding site motifs be- 
fore discussing the probabilistic model on which our ap- 
proach is based, as well as the details of the algorithm. 



A. TF-DNA Binding 
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The interaction between a TF and a specific sequence 
of DNA can be written as a pair of ordinary differen- 
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tial equations, describing the variation of bound and free 
concentrations of both reactants. Such a model depends 
on the reaction rates for binding and dissociation of TFs 
with DNA, symbolically stating 



bind 

TF + DNA ^ TF o DNA 



(i) 



with equilibrium constants for binding and dissociation 
^bind an d i^diss ? respectively. In a system with many par- 
ticles, the equilibrium concentrations of free TFs, specific 
DNA sequences and bound TF o DNA complexes can be 
related by the Arrhenius equation 



if bind = [TF o DNA] 
K diss [TF] [DNA] 



(2) 



where /3 is an inverse temperature and E(S) stands for 
the free energy of binding the TF to DNA with a specific 
sequence S. Accepting the probability for the TF binding 
the sequence S to be be given by 



[TF o DNA] 



[TF o DNA] + [DNA] ' 



it follows 



e 0[E(S)-n] 



(3) 



(4) 



with the chemical potential /i, relating abundance of the 
TF and affinity to its binding sites. The chemical poten- 
tial is given by 



V = k B T log [TF] + C, 



(5) 



up to the additive constant C. 



B. Binding Site Motifs 

To define a binding motif from a collection of known 
binding sites Si, ... , Sat, of length L each, the construc- 
tion of a matrix w V i containing statistical weights for the 
occurrence of a nucleotide n at position i in the motif is 
usually adopted [8]. 



f i 

w ni = log — with f ni 



7V + 4 



(6) 



are constructed by counting the occurrences of nu- 
cleotide 77 at position i in each of the binding sites and 
comparing the thus defined frequencies f v i to the proba- 
bilities p v with which to expect 77 in the sequence. Those 
probabilities can be deduced from the whole genome in 
question, from shorter regions containing the binding 
sites or even just from the latter. f ni usually takes into 
account the error due to the finite amount of sequences in 
the collection by adding pseudocounts |8 , 9 to the occur- 
rence counts. The weight matrix construction can then 



be used to evaluate the information content of a motif 
denoted by w 



rji 



(7) 



a measure for the dissimilarity between the motif and 
random sequences stemming from the probabilistic model 
defined by p n . Further, w V i can be applied to find pu- 
tative binding sites in a genome. Each subsequence of 
length L is thus associated to an information score [TO] , 
describing the likelihood of that sequence to belong to the 
set of binding sites. How to chose a threshold score dis- 
criminating putative binders from non-binders, however, 
remains an open question in the weight matrix approach 
[threshold] . 

A more subtle description of binding motifs by free 
energy matrices [7] addresses the problem of finding a 
threshold by inverting the interpretation of a binding mo- 
tif. Instead of describing similarities in a set of sequences, 
one attempts to model the requirements of a sequence to 
be able to bind a specific TF, now itself represented by 
the motif. The construction of such energy matrices e V i 
is based upon the assumption that binding motifs repre- 
sented by £ should maximize the probability of recovering 
the set 9JI = {Mi, M2, . . . , Mn} of known binding sites 
from an ensemble of random sequences, while the prob- 
ability of identifying binding sites on unrelated random 
sequences is minimized. This can be performed maximiz- 
ing the likelihood 



&[m] = n pr (^(s) n i 1 



Pv(S')P E (S')} , (8) 



with probabilities Pr(M) of generating a binding site se- 
quence M and binding probabilities P £ (M) for a TF to 
bind to this sequence. 

Maximizing £ can be shown to be equivalent to mini- 
mizing the variance <j 2 e of free energies resulting from the 
TF (e) binding to random sequences [7 , leading to an 
optimal e by solving 

argmincr^ subject to E £ (M) < Ro , (9) 

£ 

where Ro is a threshold free energy defining binding sites. 

In the inference method described by Djordjevic et at. 
[7], s is efficiently approximated in the low temperature 
limit (3 — > 00 by quadratic programming. The elements 
of the energy matrix are shifted by the mean free energy 
of the TF being bound to random sequences (E) £ and 
rescaled by the absolute value of its chemical potential 
p. The evaluation of on a specific nucleotide sequence 
M = (ai, a<2, • • • , ol w ) gives the at first sight somewhat 
cumbersome result 



R £ (M) = ^2s ail = 



E £ (S) - (E) e 



(E)e 



(10) 



where E £ (M) is the free energy of a TF associated to a 
TF e binding to M. Yet this representation has a major 
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advantage as compared to weight matrices: the chemical 
potential discriminates between strong and weak bind- 
ing sites and since e^i is directly inferred in terms of /i, 
the threshold is implicitly given as Ro = — 1. All se- 
quences with R £ (M) < Rq are thus presumably strong 
binding sites, while R £ (M) > Ro denotes weak and non- 
binders. More explicitly, the probability to find a specific 
TF bound to motif sequence M is 



Pe(M) 



1 



I + eP[Re(M)+i] 



with the rescaled inverse temperature 



\»-(E)e\ 

knT 



(11) 



(12) 



For further details we also refer to [7J [TTJ [12] and ref- 
erences therein. Note, however, that f3 remains a free 
parameter as long as estimates for the average energy 
and the chemical potential are missing. Varying (3 does 
obviously not change the qualitative result in ( 11 ) stating 
P £ > 0.5 if R £ > i?o ? but will lead to a sharper discrimi- 
nation of binding sequences from non-binding ones. 



C. Probabilistic Model 

Let us introduce a first order Markov model for the ge- 
nomic background with conditional probabilities for the 
generation of a sequence M = (ai, a<2, . . . , a w ) written 

as 



Pr(7? | v) = Pr(find r] preceded by v) 



(13) 



where r] and v represent single nucleotides. The probabil- 
ity of finding M among random sequences is thus given 

by 



Pr(M) = JJPr(aj | a 4 _i 

understanding the boundary condition 
Pr(ai | a ) = Pr(ai) . 



(14) 



(15) 



Adopting a more compact notation, we introduce the pas- 
sage matrices Wi(/3) of probabilities for a TF to be bound 
to a site featuring the pair of nucleotides vr) at position % 

= k* exp(-^) Pr(C I V) • (16) 

C 

Products n^(^) °f P^sage matrices apparently yield 
the probabilities of coming across TF-DNA hybrids of 
corresponding length. Consequently, the generating 
function for a motif sequence bound by e can be writ- 
ten as trace of the matrix product 



3M = tv[f[w i (P) 



(17) 



from which common statistical quantities describing 
binding of the TF to DNA can be derived. Of special 
interest is clearly the variance of free energies of a TF 
binding to random sequences, expressed as second deriva- 
tive of the generating function 



d 2 



log3 e (/3) 



0=0 



Evaluating this expression, Djordjevic et al. 
to solve for e by minimizing the variance [Jj. 



(18) 
show how 



D. Monte Carlo Sampling 

In the set of TV sequences {S±, . . . , Sn}, e.g. promoter 
regions of co-regulated genes or upstream regions of ho- 
mologous genes, we want to identify locally conserved 
subsequences of length w, supposedly sharing common 
TF binding sites. The alignment of the motif sequences 
Mi is represented by their positions on the respective 
sequence S{. Let us first introduce the alignment prob- 
ability distribution P £ (x) for a binding motif at position 
x on sequence S = (a±,a2, . . . , oll), which is constructed 
from the binding probabilities P £ by setting 



Pe(x) 



Peijox, a x+ i, ... , a x+w )) 

J2 X '=1 + Pe{{0ix',0i x '+l, ■ . ■ ,a x ' +w )) 



(19) 




. , ttcgaTCTGAGTCAIGGgccgagcgattgcgtcgaccjau. . 
, , gtcgatccaactaqqcTCGGACTCOTAGtaggctagttca. . 
, , aaCCGAACTCATCGgttagttaattgaatcgctatcgata. . 

, . ttgtgcgagtL-LjdLni L L uj LqGTTAAGTCGTGTttgcgag. . 

s 



1 TCTGAGTCATGA 

2 TCGGACTCCTAA 

3 CCGAACTCATCA 

N GTTAAGTCGTGT 



M 




FIG. 2: Schematics of the QPS algorithm taking N input 
sequences Si, ... , Sn with initialisation (i), motif extraction 
(m), matrix computation (c) and evaluation (e) steps. S2 is 
highlighted being retained for updating. 

The inference of an optimal local alignment is accom- 
plished by a standard Monte Carlo method following the 
procedure 



(i) assign random alignment positions a^, i = 1, 



(m) extract sequence motif 9JI of N — 1 sequences, ex- 
cluding Si where is to be updated 

(c) compute energy matrix e of the sequence motif M 

(e) evaluate e on the excluded sequence Si using R £ 
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Iteratively updating the alignments, we sample the dis- 
tribution of alignment positions until reaching stationar- 
ity on all of the sequences Si . The evolution of the distri- 
bution on biological sequences when inferring a binding 
motif of length 15 is shown in figure [3] 

The sampling is performed at finite temperature ((3 > 
0) in equation (11). Varying the rescaled inverse temper- 
ature during sampling allows to define a simple annealing 
schedule with stronger discrimination of possible binding 
from non-binders in the actual model as f3 grows. 



III. RESULTS AND DISCUSSION 

We validated the functionality of QPS on a small set of 
coregulated promotors in in Escherichia coli consisting 
of aceBAKp, icdAp, pckAp, and ptsHp. Each region 
contains an experimentally known binding site for the 
fructose repressor protein FruR, which we tried to infer. 
The heatmaps in figure [3] illustrate the development of 



region 



sequence 



aceBAK 24 CCTCATGCGCTTCTG 
icdA 49 GCTGAATCGCTTAAC 
pckA 7 CCCAAAGCGCCTTTT 
ptsH 66 GCTGAATCGATTTTA 



/ +0.3768 
+0.3890 
+0.3890 
-0.0761 
-1.1689 
-0.7724 
+0.3890 
+0.3890 
+0.3890 
+0.1072 
+0.3890 
+0.3890 
-0.0254 
-0.0254 

\ +0.1072 



-0.2691 
+0.4208 
+0.4208 
-0.2755 
+0.4208 
+0.4208 
-0.4408 
+0.4208 
-1.1371 
+0.4208 
+0.4208 
+0.4208 
+0.4208 
+0.4208 
+0.0243 



-0.4809 
-1.1658 
-0.0730 
-0.0043 
+0.3921 
+0.3921 
+0.3921 
-1.1658 
+0.3921 
-0.8839 
-0.0730 
+0.3921 
-0.0043 
+0.3921 
-0.0223 



+0.3768 \ 

+0.3596 

-0.7332 

+0.3596 

+0.3596 

-0.0369 

-0.3367 

+0.3596 

+0.3596 

+0.3596 

-0.7332 

-1.1983 

-0.3874 

-0.7839 

-0.1056 J 



10 20 30 40 50 60 70 80 

alignment position 



TABLE I: QPS alignment of the four regions sharing a FruR 
binding site including the resulting energy matrix of the cor- 
responding TF. 



FIG. 3: Heat maps of the evolution of alignment position 
probability distributions on the promoter regions of different 
operons in Escherichia coli. The regions contain one known 
TF binding site for FruR each and were aligned by QPS. At 
iteration 14, the distributions have reached their stationary 
form. 



(m') draw new alignment position a$ from the alignment 
probability distribution P £ (a) and iterate with (c), 

Figure [2] illustrates the procedure in which each itera- 
tion draws a new alignment on the skipped sequence. 



the alignment probability distribution P £ when iterating 
with QPS and one of the obtained alignments is presented 
in table [l| We settled for initiating the algorithm with 
(3 = 20 and allowing for subsequent linear augmentation 
as the sampling proceeds, 

Pk~k (20) 

with sampling iteration k. This corresponds approxi- 
mately to an annealing schedule ~ T _1 for the tempera- 
ture. 

Exact binding sites remaining unknown for a vast 
number of TFs, and it is an interesting problem to try 
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to infer a binding motif by aligning a set of sequences 
which are supposed to share sites for a specific TF. A 
wide range of different approaches have been developed 
[U H3] ; greedy pattern search algorithms context 
free grammar constructors [15], and several statistical 
methods [HUE], to cite but just a small selection. Still 
it appears that no single method is capable of identifying 
motifs in a reliable way [5j [18] and more recent ap- 
proaches tend to combine several algorithms [4j [19] to get 
a certain degree of cross-validation between individual 
methods. The method we present has been conceptually 
verified on a small sample of Escherichia coli promoter 
regions and might prove useful in combination with 
other approaches. The advantage of our algorithm is 
that it makes direct use of a biophysical representation 
of the TF. This representation is provided as result and 



can be readily applied to predict yet unknown binding 
sites elsewhere on the genome. 

The here described algorithm has been implemented 
in C++ and is publicly available under the GPL on 
http : //www . esc . kth . se/~a f d/qps/ . 
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